Statistical Analysis of AI Salaries (2025) 🤖
Using an AI jobs dataset
Author: Tayfur Akkaya Clavijo
Project: Statistical Analysis & Salary Modeling in the AI Job Market (2025)
Analysis objective:
The purpose of this study is to model and predict salaries for professionals in the AI job market, ensuring statistical excellence and scientific rigor.
Context: Final project for an advanced Statistics for Data Science course.
Table of Contents¶
- Data Loading and Setup
- Exploratory Data Analysis (EDA)
2.1 Initial Review
2.2 Visualizing Variable Distributions
2.2.1 Target Variable
2.2.2 Predictor Variables - Outliers
3.1 Identification
3.2 Treatment - Statistical Inference
4.1 Normality Assessment
4.2 Homoscedasticity Assessment (Equal Variances)
4.3 Overall Analysis: Kruskal–Wallis
4.4 Specific Analysis (Post-hoc): Quantifying Differences (Bootstrapping) - Correlation and Association Analysis
- Predictive Modeling: Multiple Linear Regression
6.1 Variable Preparation: Dummy Encoding
6.2 Model Selection: Linear vs. Logarithmic
6.3 Detailed Interpretation of Statistical Results
6.4 Prediction and Confidence Intervals
6.5 3D Visualization (Error Visualization)
6.6 Model Uncertainty Analysis - Final Conclusions
# Imports
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from scipy import stats
import statsmodels.api as sm
from scipy.stats import bootstrap
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
import kagglehub
import os
# Aesthetic configuration for plots
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
# Download the latest version of the dataset
path = kagglehub.dataset_download("pratyushpuri/global-ai-job-market-trend-2025")
print("Path to dataset files:", path)
# Exact name of the CSV file
csv_file_name = "ai_job_dataset.csv"
# Combine the folder path with the CSV file name
full_file_path = os.path.join(path, csv_file_name)
# Ensure that the CSV file has been downloaded and loaded correctly
try:
# Read the CSV file using the full path
# Apply 'parse_dates' to the date columns
df = pd.read_csv(
full_file_path,
parse_dates=['posting_date', 'application_deadline']
)
print(
f"✅ Dataset '{csv_file_name}' successfully loaded: "
f"{df.shape[0]} rows, {df.shape[1]} columns."
)
except FileNotFoundError:
print(
f"⚠️ ERROR: '{csv_file_name}' not found. "
"Please make sure the download process completed correctly."
)
Path to dataset files: C:\Users\tayfu\.cache\kagglehub\datasets\pratyushpuri\global-ai-job-market-trend-2025\versions\1 ✅ Dataset 'ai_job_dataset.csv' successfully loaded: 15000 rows, 19 columns.
import plotly.io as pio
# This option forces Plotly to be saved as a full HTML block inside the notebook file
pio.renderers.default = "notebook"
2. Exploratory Data Analysis (EDA)¶
Our first goal is to understand the data. To do so, we will build a data dictionary based on our observations and by visualizing the distribution of the variables.
| Field Name | Description | Data Type | Format / Example | Possible Values / Range | Nulls Allowed | Notes | Semantic Type |
|---|---|---|---|---|---|---|---|
job_id |
Unique identifier for each job posting | Categorical (string) | 'AI00001' | N/A | No | Primary key, no duplicates. | Categorical |
job_title |
Job title or role advertised | Categorical (string) | 'AI Research Scientist' | 20 | No | Includes all possible roles. 20 unique values. | Categorical |
salary_usd |
Annual salary converted to USD | Integer (int) | 90,376 | ≥ 0 | No | Salary expressed in US dollars. | Ratio |
salary_currency |
Currency in which the salary is offered | Categorical (string) | 'USD' | 'USD', 'EUR', 'GBP' | No | USD = US Dollar, EUR = Euro, GBP = Pound Sterling. | Categorical |
experience_level |
Required level of professional experience | Categorical (string) | 'SE' | 'EN', 'MI', 'SE', 'EX' | No | EN = Entry, MI = Mid, SE = Senior, EX = Executive. | Categorical / Ordinal |
employment_type |
Type of employment contract | Categorical (string) | 'CT' | 'FT', 'PT', 'CT', 'FL' | No | CT = Contract, FT = Full-time, PT = Part-time, FL = Freelance. | Categorical |
company_location |
Country where the company is based | Categorical (string) | 'China' | 20 | No | 20 unique country values. | Categorical |
company_size |
Size of the company | Categorical (string) | 'S' | 3 | No | S = Small, M = Medium, L = Large. | Categorical / Ordinal |
employee_residence |
Country of employee residence | Categorical (string) | 'China' | 20 | No | 20 unique country values. | Categorical |
remote_ratio |
Percentage of remote work allowed | Integer (int) | 50 | {0, 50, 100} | No | 0 = On-site, 50 = Hybrid, 100 = Fully remote. | Ratio |
required_skills |
Key technical skills required (comma-separated) | Categorical (string) | 'Python' | 13,663 | No | Skills are comma-separated. 13,663 unique values. | Categorical |
education_required |
Minimum education level required | Categorical (string) | 'Associate' | 'Associate', 'Bachelor', 'Master', 'PhD' | No | 4 unique education levels. | Categorical / Ordinal |
years_experience |
Minimum years of experience required | Integer (int) | 4 | 0 – 20 | No | Discrete numeric values. | Ratio |
industry |
Industry sector of the job | Categorical (string) | 'Media' | 15 | No | 15 unique industry sectors. | Categorical |
posting_date |
Date when the job was posted | Date | '2024-10-18' | N/A | No | ISO format: YYYY-MM-DD. | Interval |
application_deadline |
Final application deadline | Date | '2024-11-07' | N/A | No | ISO format: YYYY-MM-DD. | Interval |
job_description_length |
Length of the job description (characters) | Integer (int) | 1,076 | N/A | No | Number of characters in the description text. | Ratio |
benefits_score |
Numeric score reflecting benefits quality | Float | 9.4 | 0 – 10 | No | Higher values indicate better benefits. | Ratio |
company_name |
Name of the hiring company | Categorical (string) | 'Smart Analytics' | 16 | No | 16 unique company names. | Categorical |
2.1 Initial Review¶
# Quick overview of the DataFrame
df.head()
| job_id | job_title | salary_usd | salary_currency | experience_level | employment_type | company_location | company_size | employee_residence | remote_ratio | required_skills | education_required | years_experience | industry | posting_date | application_deadline | job_description_length | benefits_score | company_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AI00001 | AI Research Scientist | 90376 | USD | SE | CT | China | M | China | 50 | Tableau, PyTorch, Kubernetes, Linux, NLP | Bachelor | 9 | Automotive | 2024-10-18 | 2024-11-07 | 1076 | 5.9 | Smart Analytics |
| 1 | AI00002 | AI Software Engineer | 61895 | USD | EN | CT | Canada | M | Ireland | 100 | Deep Learning, AWS, Mathematics, Python, Docker | Master | 1 | Media | 2024-11-20 | 2025-01-11 | 1268 | 5.2 | TechCorp Inc |
| 2 | AI00003 | AI Specialist | 152626 | USD | MI | FL | Switzerland | L | South Korea | 0 | Kubernetes, Deep Learning, Java, Hadoop, NLP | Associate | 2 | Education | 2025-03-18 | 2025-04-07 | 1974 | 9.4 | Autonomous Tech |
| 3 | AI00004 | NLP Engineer | 80215 | USD | SE | FL | India | M | India | 50 | Scala, SQL, Linux, Python | PhD | 7 | Consulting | 2024-12-23 | 2025-02-24 | 1345 | 8.6 | Future Systems |
| 4 | AI00005 | AI Consultant | 54624 | EUR | EN | PT | France | S | Singapore | 100 | MLOps, Java, Tableau, Python | Master | 0 | Media | 2025-04-15 | 2025-06-23 | 1989 | 6.6 | Advanced Robotics |
We can extract the number of rows and columns with shape:
f,c = df.shape
print(f"The DataFrame contains {f} rows and {c} columns")
The DataFrame contains 15000 rows and 19 columns
We run info():
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15000 entries, 0 to 14999 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 job_id 15000 non-null object 1 job_title 15000 non-null object 2 salary_usd 15000 non-null int64 3 salary_currency 15000 non-null object 4 experience_level 15000 non-null object 5 employment_type 15000 non-null object 6 company_location 15000 non-null object 7 company_size 15000 non-null object 8 employee_residence 15000 non-null object 9 remote_ratio 15000 non-null int64 10 required_skills 15000 non-null object 11 education_required 15000 non-null object 12 years_experience 15000 non-null int64 13 industry 15000 non-null object 14 posting_date 15000 non-null datetime64[ns] 15 application_deadline 15000 non-null datetime64[ns] 16 job_description_length 15000 non-null int64 17 benefits_score 15000 non-null float64 18 company_name 15000 non-null object dtypes: datetime64[ns](2), float64(1), int64(4), object(12) memory usage: 2.2+ MB
df.info()only prints information to the screen and does not return a manipulable object (it returnsNone). Therefore, it is useful to create our owninfo_df(), which produces a DataFrame summary that we can work with programmatically.
So, to improve on info()—where the information cannot be processed programmatically (it only prints; it does not return anything)—we will build our own summary:
def info_df(df):
return pd.DataFrame({
'Column': df.columns,
'Non-Null': df.notnull().sum().values,
'Null': df.isnull().sum().values,
'Python Type': df.dtypes.values,
'Num. Unique Values': [len(df[col].unique()) for col in df.columns],
'Sample Values': [df[col].unique()[0:(min(5, len(df[col].unique())))] for col in df.columns],
})
info = info_df(df)
info
| Column | Non-Null | Null | Python Type | Num. Unique Values | Sample Values | |
|---|---|---|---|---|---|---|
| 0 | job_id | 15000 | 0 | object | 15000 | [AI00001, AI00002, AI00003, AI00004, AI00005] |
| 1 | job_title | 15000 | 0 | object | 20 | [AI Research Scientist, AI Software Engineer, ... |
| 2 | salary_usd | 15000 | 0 | int64 | 14315 | [90376, 61895, 152626, 80215, 54624] |
| 3 | salary_currency | 15000 | 0 | object | 3 | [USD, EUR, GBP] |
| 4 | experience_level | 15000 | 0 | object | 4 | [SE, EN, MI, EX] |
| 5 | employment_type | 15000 | 0 | object | 4 | [CT, FL, PT, FT] |
| 6 | company_location | 15000 | 0 | object | 20 | [China, Canada, Switzerland, India, France] |
| 7 | company_size | 15000 | 0 | object | 3 | [M, L, S] |
| 8 | employee_residence | 15000 | 0 | object | 20 | [China, Ireland, South Korea, India, Singapore] |
| 9 | remote_ratio | 15000 | 0 | int64 | 3 | [50, 100, 0] |
| 10 | required_skills | 15000 | 0 | object | 13663 | [Tableau, PyTorch, Kubernetes, Linux, NLP, Dee... |
| 11 | education_required | 15000 | 0 | object | 4 | [Bachelor, Master, Associate, PhD] |
| 12 | years_experience | 15000 | 0 | int64 | 20 | [9, 1, 2, 7, 0] |
| 13 | industry | 15000 | 0 | object | 15 | [Automotive, Media, Education, Consulting, Hea... |
| 14 | posting_date | 15000 | 0 | datetime64[ns] | 486 | [2024-10-18 00:00:00, 2024-11-20 00:00:00, 202... |
| 15 | application_deadline | 15000 | 0 | datetime64[ns] | 543 | [2024-11-07 00:00:00, 2025-01-11 00:00:00, 202... |
| 16 | job_description_length | 15000 | 0 | int64 | 2000 | [1076, 1268, 1974, 1345, 1989] |
| 17 | benefits_score | 15000 | 0 | float64 | 51 | [5.9, 5.2, 9.4, 8.6, 6.6] |
| 18 | company_name | 15000 | 0 | object | 16 | [Smart Analytics, TechCorp Inc, Autonomous Tec... |
df.describe(percentiles=[0.25, 0.5, 0.75, 0.95, 0.99]).T
| count | mean | min | 25% | 50% | 75% | 95% | 99% | max | std | |
|---|---|---|---|---|---|---|---|---|---|---|
| salary_usd | 15000.0 | 115348.965133 | 32519.0 | 70179.75 | 99705.0 | 146408.5 | 237987.95 | 307775.69 | 399095.0 | 60260.940438 |
| remote_ratio | 15000.0 | 49.483333 | 0.0 | 0.0 | 50.0 | 100.0 | 100.0 | 100.0 | 100.0 | 40.812712 |
| years_experience | 15000.0 | 6.2532 | 0.0 | 2.0 | 5.0 | 10.0 | 17.0 | 19.0 | 19.0 | 5.545768 |
| posting_date | 15000 | 2024-08-29 08:48:51.840000 | 2024-01-01 00:00:00 | 2024-04-29 00:00:00 | 2024-08-28 00:00:00 | 2024-12-29 00:00:00 | 2025-04-06 00:00:00 | 2025-04-26 00:00:00 | 2025-04-30 00:00:00 | NaN |
| application_deadline | 15000 | 2024-10-11 21:55:23.520000 | 2024-01-16 00:00:00 | 2024-06-13 00:00:00 | 2024-10-12 00:00:00 | 2025-02-10 00:00:00 | 2025-05-21 00:00:00 | 2025-06-19 00:00:00 | 2025-07-11 00:00:00 | NaN |
| job_description_length | 15000.0 | 1503.314733 | 500.0 | 1003.75 | 1512.0 | 2000.0 | 2402.0 | 2479.0 | 2499.0 | 576.127083 |
| benefits_score | 15000.0 | 7.504273 | 5.0 | 6.2 | 7.5 | 8.8 | 9.8 | 9.9 | 10.0 | 1.45087 |
Caution about outliers (we will revisit this later).
df.describe(include="O").T
| count | unique | top | freq | |
|---|---|---|---|---|
| job_id | 15000 | 15000 | AI00001 | 1 |
| job_title | 15000 | 20 | Machine Learning Researcher | 808 |
| salary_currency | 15000 | 3 | USD | 11957 |
| experience_level | 15000 | 4 | MI | 3781 |
| employment_type | 15000 | 4 | FT | 3812 |
| company_location | 15000 | 20 | Germany | 814 |
| company_size | 15000 | 3 | S | 5007 |
| employee_residence | 15000 | 20 | Sweden | 790 |
| required_skills | 15000 | 13663 | Python, TensorFlow, PyTorch | 17 |
| education_required | 15000 | 4 | Bachelor | 3789 |
| industry | 15000 | 15 | Retail | 1063 |
| company_name | 15000 | 16 | TechCorp Inc | 980 |
Some observations we can already see are:
- Top job title: Machine Learning Researcher
- Top salary currency: USD
- Top experience level: MI — Mid level
- Top employment type: FT — Full time
- Top company size: S — Small
- Top employee residence: Sweden
- Top required skills (raw field): Python, TensorFlow, PyTorch
- Top education required: Bachelor's
- Top industry: Retail
In required_skills, the top value returns multiple skills. This happens because it is returning the most frequent combination, which is not the same as the most demanded individual skills.
Therefore, we will explode the skills list to get a clearer picture of which skills are most requested—an interesting complement to the analysis.
# Convert NaN to an empty string (we don't have any, but we include this as a best practice)
skills_series = df['required_skills'].fillna('')
# Split by commas -> converts each row into a list
skills_lists = skills_series.str.split(',')
# Strip whitespace around each skill
skills_lists = skills_lists.apply(lambda lst: [s.strip() for s in lst if s.strip() != ''])
# Explode -> each skill becomes its own row
skills_exploded = skills_lists.explode()
# Count frequencies
top_skills = skills_exploded.value_counts().head(10)
print("Top most in-demand skills:")
print(top_skills)
Top most in-demand skills: required_skills Python 4450 SQL 3407 TensorFlow 3022 Kubernetes 3009 Scala 2794 PyTorch 2777 Linux 2705 Git 2631 Java 2578 GCP 2442 Name: count, dtype: int64
We see very relevant, widely used skills in this industry: first Python, then SQL, and third TensorFlow—highly relevant for the sector.
Now, as a best practice, we will convert ordinal columns to ordered categories in Python (ordered=True) and categorical columns to unordered categories (ordered=False) (except job_id, job_title, and required_skills, which we keep as objects for convenience).
from pandas.api.types import CategoricalDtype
# Define the logical ordering for each ordinal variable
size_order = CategoricalDtype(categories=['S', 'M', 'L'], ordered=True)
# Assume this order: Entry -> Mid -> Senior -> Executive
exp_order = CategoricalDtype(categories=['EN', 'MI', 'SE', 'EX'], ordered=True)
# Assume this academic order
edu_order = CategoricalDtype(categories=['Associate', 'Bachelor', 'Master', 'PhD'], ordered=True)
# Apply the transformations
df['company_size'] = df['company_size'].astype(size_order)
df['experience_level'] = df['experience_level'].astype(exp_order)
df['education_required'] = df['education_required'].astype(edu_order)
nominal_cols = [
'employment_type', 'salary_currency', 'industry',
'company_location', 'employee_residence', 'company_name'
]
for col in nominal_cols:
df[col] = df[col].astype('category')
# Since we do not specify 'ordered=True', Pandas treats these as unordered categories
This is how it looks now in info:
info = info_df(df)
info
| Column | Non-Null | Null | Python Type | Num. Unique Values | Sample Values | |
|---|---|---|---|---|---|---|
| 0 | job_id | 15000 | 0 | object | 15000 | [AI00001, AI00002, AI00003, AI00004, AI00005] |
| 1 | job_title | 15000 | 0 | object | 20 | [AI Research Scientist, AI Software Engineer, ... |
| 2 | salary_usd | 15000 | 0 | int64 | 14315 | [90376, 61895, 152626, 80215, 54624] |
| 3 | salary_currency | 15000 | 0 | category | 3 | ['USD', 'EUR', 'GBP'] Categories (3, object): ... |
| 4 | experience_level | 15000 | 0 | category | 4 | ['SE', 'EN', 'MI', 'EX'] Categories (4, object... |
| 5 | employment_type | 15000 | 0 | category | 4 | ['CT', 'FL', 'PT', 'FT'] Categories (4, object... |
| 6 | company_location | 15000 | 0 | category | 20 | ['China', 'Canada', 'Switzerland', 'India', 'F... |
| 7 | company_size | 15000 | 0 | category | 3 | ['M', 'L', 'S'] Categories (3, object): ['S' <... |
| 8 | employee_residence | 15000 | 0 | category | 20 | ['China', 'Ireland', 'South Korea', 'India', '... |
| 9 | remote_ratio | 15000 | 0 | int64 | 3 | [50, 100, 0] |
| 10 | required_skills | 15000 | 0 | object | 13663 | [Tableau, PyTorch, Kubernetes, Linux, NLP, Dee... |
| 11 | education_required | 15000 | 0 | category | 4 | ['Bachelor', 'Master', 'Associate', 'PhD'] Cat... |
| 12 | years_experience | 15000 | 0 | int64 | 20 | [9, 1, 2, 7, 0] |
| 13 | industry | 15000 | 0 | category | 15 | ['Automotive', 'Media', 'Education', 'Consulti... |
| 14 | posting_date | 15000 | 0 | datetime64[ns] | 486 | [2024-10-18 00:00:00, 2024-11-20 00:00:00, 202... |
| 15 | application_deadline | 15000 | 0 | datetime64[ns] | 543 | [2024-11-07 00:00:00, 2025-01-11 00:00:00, 202... |
| 16 | job_description_length | 15000 | 0 | int64 | 2000 | [1076, 1268, 1974, 1345, 1989] |
| 17 | benefits_score | 15000 | 0 | float64 | 51 | [5.9, 5.2, 9.4, 8.6, 6.6] |
| 18 | company_name | 15000 | 0 | category | 16 | ['Smart Analytics', 'TechCorp Inc', 'Autonomou... |
If the number of distinct values is not too large, we can display each value and its frequency to examine the variables in more detail:
for col in df.columns:
if len(df[col].unique())<45:
print(df[col].value_counts())
print("="*40)
job_title Machine Learning Researcher 808 AI Software Engineer 784 Autonomous Systems Engineer 777 Machine Learning Engineer 772 AI Architect 771 Head of AI 765 NLP Engineer 762 Robotics Engineer 759 Data Analyst 759 AI Research Scientist 756 Data Engineer 749 AI Product Manager 743 Research Scientist 742 Principal Data Scientist 734 AI Specialist 728 ML Ops Engineer 725 Computer Vision Engineer 724 Data Scientist 720 Deep Learning Engineer 718 AI Consultant 704 Name: count, dtype: int64 ======================================== salary_currency USD 11957 EUR 2314 GBP 729 Name: count, dtype: int64 ======================================== experience_level MI 3781 EX 3760 SE 3741 EN 3718 Name: count, dtype: int64 ======================================== employment_type FT 3812 FL 3758 CT 3721 PT 3709 Name: count, dtype: int64 ======================================== company_location Germany 814 Denmark 778 Canada 769 France 769 Austria 765 Singapore 764 China 763 India 754 Sweden 752 Israel 751 Ireland 750 Switzerland 746 Finland 733 Japan 733 Australia 732 Netherlands 731 United Kingdom 729 United States 724 South Korea 722 Norway 721 Name: count, dtype: int64 ======================================== company_size S 5007 L 4998 M 4995 Name: count, dtype: int64 ======================================== employee_residence Sweden 790 France 781 Denmark 777 Austria 776 India 772 Germany 769 South Korea 763 Canada 762 China 761 Netherlands 758 United Kingdom 750 Switzerland 748 Ireland 740 Singapore 740 Israel 731 Australia 730 Norway 726 United States 716 Finland 710 Japan 700 Name: count, dtype: int64 ======================================== remote_ratio 0 5075 50 5005 100 4920 Name: count, dtype: int64 ======================================== education_required Bachelor 3789 Associate 3785 Master 3748 PhD 3678 Name: count, dtype: int64 ======================================== years_experience 0 1890 1 1828 4 1295 3 1247 2 1239 7 769 5 757 6 753 9 742 8 720 16 403 15 392 13 391 10 384 19 378 11 373 14 364 17 363 12 362 18 350 Name: count, dtype: int64 ======================================== industry Retail 1063 Media 1045 Automotive 1020 Consulting 1020 Technology 1011 Real Estate 1007 Government 998 Healthcare 997 Telecommunications 997 Transportation 997 Finance 984 Energy 976 Gaming 967 Manufacturing 962 Education 956 Name: count, dtype: int64 ======================================== company_name TechCorp Inc 980 Cognitive Computing 972 AI Innovations 964 Digital Transformation LLC 961 Future Systems 960 Quantum Computing Inc 960 Cloud AI Solutions 951 Predictive Systems 947 Smart Analytics 927 Advanced Robotics 925 Machine Intelligence Group 922 Neural Networks Co 922 Autonomous Tech 918 DataVision Ltd 909 DeepTech Ventures 897 Algorithmic Solutions 885 Name: count, dtype: int64 ========================================
Overall, the value frequencies do not show large differences.
df.duplicated().sum() # no duplicates
0
Code to compute the mode of each column and its relevance:
# Show only the first (most frequent) value in each column, along with its count
header = f"{'Column':<25} {'Mode':<30} {'Frequency':<15} {'% of non-nulls':<15}"
print(header)
print("-" * len(header))
for col in df.columns:
column_non_null = df[col].dropna() # drop nulls before counting rows (we don't have any, but it's good practice)
frequencies = df[col].value_counts()
mode_value = frequencies.index[0] if not df[col].mode().empty else None
non_null_count = df[col].notnull().sum()
mode_count = frequencies.values[0]
mode_percentage = (mode_count / non_null_count * 100) if non_null_count > 0 else 0
print(f"{col:<25} {str(mode_value):<30} {mode_count:<15} {round(mode_percentage, 2):<15}")
Column Mode Frequency % of non-nulls ---------------------------------------------------------------------------------------- job_id AI00001 1 0.01 job_title Machine Learning Researcher 808 5.39 salary_usd 67253 4 0.03 salary_currency USD 11957 79.71 experience_level MI 3781 25.21 employment_type FT 3812 25.41 company_location Germany 814 5.43 company_size S 5007 33.38 employee_residence Sweden 790 5.27 remote_ratio 0 5075 33.83 required_skills Python, TensorFlow, PyTorch 17 0.11 education_required Bachelor 3789 25.26 years_experience 0 1890 12.6 industry Retail 1063 7.09 posting_date 2024-07-05 00:00:00 51 0.34 application_deadline 2025-01-05 00:00:00 47 0.31 job_description_length 1519 19 0.13 benefits_score 9.9 338 2.25 company_name TechCorp Inc 980 6.53
Consistent with what we have seen so far.
# Code to get numeric columns
numeric_columns = df.select_dtypes(include=['number']).columns.tolist()
header = f"{'Column':<30} {'Mean':<10} {'Median':<10}"
print(header)
print("-" * len(header))
for col in numeric_columns:
mean = df[col].mean()
median = df[col].median()
print(f"{col:<30} {round(mean,2):<10} {round(median,2):<10}")
Column Mean Median ---------------------------------------------------- salary_usd 115348.97 99705.0 remote_ratio 49.48 50.0 years_experience 6.25 5.0 job_description_length 1503.31 1512.0 benefits_score 7.5 7.5
When the mean is very different from the median (as with salary_usd), it indicates a strongly skewed distribution (in the direction of the mean). Since the mean is more affected by extreme values, the median is the preferred measure of central tendency (more robust). In our case, we can infer the presence of very high salaries.
To analyze our variables of interest in more detail, we will plot them next.
2.2 Visualizing Variable Distributions¶
We will plot our target variable salary_usd to see its distribution. We will also visualize how salary behaves with respect to our variables of interest (experience_level and company_size), which will be our candidate predictors.
2.2.1 Target Variable¶
For the histogram of salary_usd (our target variable), we will take one additional step to compute the optimal number of bins.
We choose the Freedman–Diaconis rule to determine the optimal number of bins in the histogram because our data shows significant positive skewness. This rule uses the interquartile range (IQR), which makes it robust to extreme values and suitable for real-world, non-normal distributions.
It also adapts dynamically to the sample size and avoids selecting an arbitrary number of bins, providing a more faithful visualization of the distribution.
We can do it automatically with bins="auto" (first alternative in the next code cell):
Using this approach, two proposals of bins are computed:
- Sturges → good for near-normal data
- Freedman–Diaconis → good for skewed distributions
Matplotlib then uses the larger of the two (more bins).
In our case:
The distribution is positively skewed → the IQR captures the spread → Freedman–Diaconis detects that asymmetry → FD yields many more bins than Sturges → Matplotlib selects Freedman–Diaconis because it produces more bins.
Or we can do it manually (second code cell).
# Automatic alternative using bins='auto'
# 'auto' usually uses Freedman–Diaconis when the dataset is large (we have ~15,000 rows)
sns.histplot(df['salary_usd'], bins='auto', kde=True)
plt.title('Salary distribution')
plt.xlabel('Salary (USD)')
plt.show()
# Manual alternative using Freedman–Diaconis
# Compute the interquartile range (IQR)
q1 = df['salary_usd'].quantile(0.25)
q3 = df['salary_usd'].quantile(0.75)
iqr = q3 - q1
# Compute bin width (Freedman–Diaconis rule)
# h = 2 * IQR * n^(-1/3)
data_len = len(df['salary_usd'])
bin_width = 2 * iqr * (data_len ** (-1/3))
# Compute the number of bins
data_range = df['salary_usd'].max() - df['salary_usd'].min()
num_bins = int(data_range / bin_width)
print(f"Optimal number of bins (Freedman–Diaconis): {num_bins}")
# Plot using the computed number of bins
sns.histplot(df['salary_usd'], bins=num_bins, kde=True)
plt.title('Salary distribution')
plt.xlabel('Salary (USD)')
plt.show()
Optimal number of bins (Freedman–Diaconis): 59
We can see it is heavily right-skewed (a long tail of high salaries). Most salaries cluster in the low/medium range, but there is a long tail of very high salaries. This matters for the statistical tests.
Now let's look at the boxplot:
plt.figure(figsize=(8, 6))
sns.boxplot(x=df['salary_usd'], color='lightgray')
plt.title('Salary distribution (USD)', fontsize=14)
plt.xlabel('Salary (USD)')
plt.show()
The boxplot shows that the salary distribution is clearly right-skewed. The median is around 100,000 USD.
Most salaries fall between the first and third quartiles, but there is a long tail with many very high values, shown as outliers. This confirms that salary_usd has high variability and extreme salaries.
This suggests we should treat outliers and/or consider a logarithmic scale in our regression model.
2.2.2 Predictor Variables¶
Now we will visualize each predictor individually, and then its relationship with salary.
# --- ANALYSIS OF PREDICTOR VARIABLE: experience_level ---
plt.figure(figsize=(10, 5))
# Univariate plot
el_count = df['experience_level'].value_counts()
sns.barplot(x=el_count.index, y=el_count.values)
plt.title('Frequency by experience level')
plt.xlabel('Experience level')
plt.ylabel('Employees')
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
for i, v in enumerate(el_count.values):
plt.text(i, v + 10, str(v), ha='center', va='bottom')
plt.show()
Very similar—almost identical.
# Bivariate plot (X vs Y)
plt.figure(figsize=(10, 6))
sns.boxplot(x='experience_level', y='salary_usd', data=df)
plt.title('Salary distribution by experience level', fontsize=14)
plt.xlabel('Experience level')
plt.ylabel('Salary (USD)')
plt.show()
The boxplot confirms our suspicion: as experience increases, the median salary increases.
We also clearly see points above the whiskers—those are the outliers we had already noticed.
So the boxplot helps us explain how the salary median (the central line) increases with experience level, and it also sets up the next step about outliers.
# --- ANALYSIS OF PREDICTOR VARIABLE: company_size ---
plt.figure(figsize=(10, 5))
# Univariate plot
cs_count = df['company_size'].value_counts()
sns.barplot(x=cs_count.index, y=cs_count.values)
plt.title('Frequency by company size')
plt.xlabel('Company size')
plt.ylabel('Employees')
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
for i, v in enumerate(cs_count.values):
plt.text(i, v + 10, str(v), ha='center', va='bottom')
plt.show()
Very similar as well—almost identical.
# Bivariate plot (X vs Y)
plt.figure(figsize=(10, 6))
sns.boxplot(x='company_size', y='salary_usd', data=df)
plt.title('Salary distribution by company size', fontsize=14)
plt.xlabel('Company size')
plt.ylabel('Salary (USD)')
plt.show()
The boxplot suggests that the larger the company, the higher the median salary. We also again clearly see points above the whiskers (outliers).
So the boxplot helps us explain how the median salary increases with company size.
3. Outliers¶
Salaries often have extreme values that can distort Linear Regression (which is sensitive to them).
In the previous section we identified outliers visually in our target variable. Now we will apply a numerical method to robustly locate outliers, and then decide and justify how we will treat them.
3.1 Identification¶
Following the same logic as the boxplots, but in a numerical way, we will apply the Interquartile Range (IQR) method, also known as Tukey’s Fences, since for our case (a tabular problem) it is an appropriate method to identify outliers in our target variable.
threshold = 1.5
Q1 = df['salary_usd'].quantile(0.25)
Q3 = df['salary_usd'].quantile(0.75)
IQR = Q3 - Q1
floor = Q1 - threshold * IQR
ceiling = Q3 + threshold * IQR
print(f"Interquartile Range (IQR): {IQR:,.0f}")
print(f"Tukey's fences: < {floor:,.0f} and > {ceiling:,.0f}")
Interquartile Range (IQR): 76,229 Tukey's fences: < -44,163 and > 260,752
3.2 Treatment¶
For data cleaning, we chose Tukey's statistical criterion (1.5·IQR) instead of setting an arbitrary or fixed threshold (such as the 99th percentile). This method is dynamic and allows us to detect salaries that deviate significantly from the central market distribution, ensuring that our regression model is trained on data representative of the average professional and is not biased by exceptional salaries.
In other words, while this excludes a real niche of high executives (C-suite > $260k), we prioritize the model’s generalization capability. This way, we prevent that corporate elite from distorting predictions for the vast majority of professionals in the sector.
# Identify how many points fall outside the fences
outliers = df[(df['salary_usd'] < floor) | (df['salary_usd'] > ceiling)]
print(f"Detected {len(outliers)} outliers ({len(outliers)/len(df):.2%} of the dataset).")
# Decision: filter them out
df_clean = df.drop(outliers.index).copy()
print("Treatment applied: removed records outside Tukey's fences.")
# Visualization after outlier treatment
plt.figure(figsize=(10, 4))
sns.boxplot(x=df_clean['salary_usd'])
plt.title('Distribution after removing outliers')
plt.show()
df_clean
Detected 483 outliers (3.22% of the dataset). Treatment applied: removed records outside Tukey's fences.
| job_id | job_title | salary_usd | salary_currency | experience_level | employment_type | company_location | company_size | employee_residence | remote_ratio | required_skills | education_required | years_experience | industry | posting_date | application_deadline | job_description_length | benefits_score | company_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AI00001 | AI Research Scientist | 90376 | USD | SE | CT | China | M | China | 50 | Tableau, PyTorch, Kubernetes, Linux, NLP | Bachelor | 9 | Automotive | 2024-10-18 | 2024-11-07 | 1076 | 5.9 | Smart Analytics |
| 1 | AI00002 | AI Software Engineer | 61895 | USD | EN | CT | Canada | M | Ireland | 100 | Deep Learning, AWS, Mathematics, Python, Docker | Master | 1 | Media | 2024-11-20 | 2025-01-11 | 1268 | 5.2 | TechCorp Inc |
| 2 | AI00003 | AI Specialist | 152626 | USD | MI | FL | Switzerland | L | South Korea | 0 | Kubernetes, Deep Learning, Java, Hadoop, NLP | Associate | 2 | Education | 2025-03-18 | 2025-04-07 | 1974 | 9.4 | Autonomous Tech |
| 3 | AI00004 | NLP Engineer | 80215 | USD | SE | FL | India | M | India | 50 | Scala, SQL, Linux, Python | PhD | 7 | Consulting | 2024-12-23 | 2025-02-24 | 1345 | 8.6 | Future Systems |
| 4 | AI00005 | AI Consultant | 54624 | EUR | EN | PT | France | S | Singapore | 100 | MLOps, Java, Tableau, Python | Master | 0 | Media | 2025-04-15 | 2025-06-23 | 1989 | 6.6 | Advanced Robotics |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14995 | AI14996 | Robotics Engineer | 38604 | USD | EN | FL | Finland | S | Finland | 50 | Java, Kubernetes, Azure | Bachelor | 1 | Energy | 2025-02-06 | 2025-03-25 | 1635 | 7.9 | Advanced Robotics |
| 14996 | AI14997 | Machine Learning Researcher | 57811 | GBP | EN | CT | United Kingdom | M | United Kingdom | 0 | Mathematics, Docker, SQL, Deep Learning | Master | 0 | Government | 2024-10-16 | 2024-10-30 | 1624 | 8.2 | Smart Analytics |
| 14997 | AI14998 | NLP Engineer | 189490 | USD | EX | CT | South Korea | L | South Korea | 50 | Scala, Spark, NLP | Associate | 17 | Manufacturing | 2024-03-19 | 2024-05-02 | 1336 | 7.4 | AI Innovations |
| 14998 | AI14999 | Head of AI | 79461 | EUR | EN | FT | Netherlands | M | Netherlands | 0 | Java, Computer Vision, Python, TensorFlow | PhD | 1 | Real Estate | 2024-03-22 | 2024-04-23 | 1935 | 5.6 | Smart Analytics |
| 14999 | AI15000 | Computer Vision Engineer | 56481 | USD | MI | PT | Austria | S | Austria | 50 | Scala, Azure, Deep Learning, GCP, Mathematics | PhD | 2 | Technology | 2024-07-18 | 2024-08-10 | 2492 | 7.6 | AI Innovations |
14517 rows × 19 columns
4. Statistical Inference¶
Before running regression, we should scientifically validate whether the differences we see in the plots are statistically significant.
In particular, before moving to predictive modeling, we assess whether observed salary differences across groups are not only statistically significant, but also practically meaningful. This distinction is crucial: large datasets can produce very small p-values even when real-world effects are modest.
So, to determine which variables truly influence salary and justify their inclusion in the future regression model, we will run a set of statistical tests.
4.1 Normality Assessment¶
The first step is to verify the distribution of the dependent variable (salary_usd). Since our dataset is large (N > 5000), we discard the Shapiro–Wilk test (too sensitive for large samples) and instead use the D’Agostino–Pearson test, complemented visually with Q–Q plots.
Our validation strategy has two levels:
- Global level: run a test on the full
salary_usdvariable to confirm statistically the skewness detected during EDA. - Group level: check normality within each subgroup (e.g., salary distribution only for Seniors).
- Decision: if we fail at this critical level, we will discard parametric methods and use robust alternatives such as Kruskal–Wallis.
# GLOBAL ANALYSIS
print("GLOBAL NORMALITY TEST (Variable: salary_usd)")
stat, p_value = stats.normaltest(df_clean['salary_usd'])
print(f" Statistic: {stat:.4f}, p-value: {p_value:.4e}")
if p_value < 0.05:
print(" ✅ CONCLUSION: We reject global normality (consistent with the skewness seen in EDA).")
else:
print(" ❌ CONCLUSION: We cannot reject global normality.")
GLOBAL NORMALITY TEST (Variable: salary_usd) Statistic: 1287.3386, p-value: 2.8706e-280 ✅ CONCLUSION: We reject global normality (consistent with the skewness seen in EDA).
# GROUP-BASED ANALYSIS
def check_normality_groups(df, group_col, target_col):
# Use .cat.categories to respect our hierarchical ordering
unique_groups = df[group_col].cat.categories
# Plot configuration
# Dynamically adjust the figure size based on the number of groups
fig, axes = plt.subplots(1, len(unique_groups), figsize=(5 * len(unique_groups), 4))
print(f"\nNORMALITY TEST BY GROUPS: {group_col}")
for i, group in enumerate(unique_groups):
# Filter data for the specific group
data = df[df[group_col] == group][target_col]
# A. Statistical test (D'Agostino)
if len(data) >= 8: # Minimum required for the test; otherwise it would not be valid
stat, p_value = stats.normaltest(data)
normality = "FAIL TO REJECT NORMALITY" if p_value > 0.05 else "NOT NORMAL"
print(f" -> Group '{group}' (n={len(data)}): p-value={p_value:.4e} => {normality}")
else:
print(f" -> Group '{group}': Not enough data for a reliable test.")
# B. Q-Q Plot
ax = axes[i] if len(unique_groups) > 1 else axes
sm.qqplot(data, line='s', ax=ax, markerfacecolor='blue', markeredgecolor='blue', alpha=0.3)
ax.set_title(f'Q-Q Plot: {group}')
plt.tight_layout()
plt.show()
# Run for Experience Level
check_normality_groups(df_clean, 'experience_level', 'salary_usd')
# Run for Company Size
check_normality_groups(df_clean, 'company_size', 'salary_usd')
NORMALITY TEST BY GROUPS: experience_level -> Group 'EN' (n=3718): p-value=5.1094e-60 => NOT NORMAL -> Group 'MI' (n=3781): p-value=1.5729e-54 => NOT NORMAL -> Group 'SE' (n=3741): p-value=5.2348e-54 => NOT NORMAL -> Group 'EX' (n=3277): p-value=1.0541e-72 => NOT NORMAL
NORMALITY TEST BY GROUPS: company_size -> Group 'S' (n=4941): p-value=1.1475e-126 => NOT NORMAL -> Group 'M' (n=4855): p-value=2.7140e-108 => NOT NORMAL -> Group 'L' (n=4721): p-value=2.5776e-71 => NOT NORMAL
Interpretation: why “non-normal” and what does it imply?¶
Normality tests (such as D’Agostino) detect deviations from normality. In large datasets (like this one), these tests are very sensitive: even small deviations can produce extremely low p-values.
That’s why the correct reading is not only “it is not normal”, but:
- Consistency with EDA: the salary distribution shows positive skewness (right tail), consistent with very high salaries in a subset.
- Implication for inference: since normality is not met (and homoscedasticity also fails), we prioritize robust / non-parametric methods to compare groups (Kruskal–Wallis) and quantify differences (bootstrapping).
- Implication for regression: in OLS, what matters is the behavior of the residuals (linearity, constant variance). That is why later we evaluate residuals and apply a log-salary transformation to stabilize variance.
In short: non-normality does not invalidate the analysis, but it conditions the choice of methods and improves the traceability of your methodological decisions.
4.2 Homoscedasticity Assessment (Equal Variances)¶
Since we have already seen that the data are not normally distributed, Bartlett’s test is not reliable. Instead, we apply Levene’s test, which is more robust because it does not assume normality.
def check_homoscedasticity(df, group_col, target_col):
# Prepare groups
groups = [df[df[group_col] == g][target_col] for g in df[group_col].unique()]
# Levene test (center='median' is more robust for skewed data)
stat, p_value = stats.levene(*groups, center='median')
print(f"--- Levene test ({group_col}) ---")
print(f"W statistic: {stat:.4f}")
print(f"p-value: {p_value:.4e}")
if p_value < 0.05:
print("✅ CONCLUSION: Variances are significantly different (heteroscedasticity).")
else:
print("❌ CONCLUSION: No evidence that variances differ.")
# Check for experience and company size
check_homoscedasticity(df_clean, 'experience_level', 'salary_usd')
check_homoscedasticity(df_clean, 'company_size', 'salary_usd')
--- Levene test (experience_level) --- W statistic: 835.7112 p-value: 0.0000e+00 ✅ CONCLUSION: Variances are significantly different (heteroscedasticity). --- Levene test (company_size) --- W statistic: 16.2998 p-value: 8.4926e-08 ✅ CONCLUSION: Variances are significantly different (heteroscedasticity).
Interpretation: what does it mean if variances are different?¶
When Levene’s test rejects equality of variances, it tells us that salary dispersion is not the same across groups. This adds nuance beyond “who earns more”:
- Which group varies more? A higher variance (or standard deviation / IQR) indicates a wider salary band: more heterogeneity in compensation within that segment (e.g., negotiation, specialization, location, bonuses).
- What does it imply for inference? With heteroscedasticity, parametric tests that assume equal variances (e.g., classic ANOVA) are less appropriate; robust or non-parametric alternatives become preferable (Kruskal–Wallis).
- What does it imply for modeling? In linear regression, heteroscedasticity breaks one of the Gauss–Markov assumptions (constant error variance), affecting the validity of standard errors and confidence intervals. That is why later we use a log transformation and residual diagnostics.
To make this concrete, we quantify dispersion by group below (variance, standard deviation, IQR, CV) to identify which category varies the most and what insight that adds to the analysis.
# Quantify dispersion by group to understand "who varies more" (not only who earns more)
def dispersion_table(df, group_col, target_col="salary_usd", observed=True):
"""Return a dispersion table by group (variance, std, IQR, CV)."""
g = df.groupby(group_col, observed=observed)[target_col]
summary = g.agg(
n="count",
mean="mean",
median="median",
std="std",
var="var",
q1=lambda x: x.quantile(0.25),
q3=lambda x: x.quantile(0.75),
)
summary["iqr"] = summary["q3"] - summary["q1"]
summary["cv"] = summary["std"] / summary["mean"]
return summary.sort_values("std", ascending=False)
for col in ["experience_level", "company_size"]:
disp = dispersion_table(df_clean, col, "salary_usd", observed=True)
display(disp)
g_max = disp["std"].idxmax()
g_min = disp["std"].idxmin()
std_ratio = disp.loc[g_max, "std"] / disp.loc[g_min, "std"]
print(
f"➡️ Highest dispersion (STD) in '{col}': {g_max} "
f"(std={disp.loc[g_max,'std']:,.0f} | IQR={disp.loc[g_max,'iqr']:,.0f} | CV={disp.loc[g_max,'cv']:.2f})"
)
print(
f"➡️ Lowest dispersion (STD) in '{col}': {g_min} "
f"(std={disp.loc[g_min,'std']:,.0f} | IQR={disp.loc[g_min,'iqr']:,.0f} | CV={disp.loc[g_min,'cv']:.2f})"
)
print(f"📈 Dispersion ratio (std max / std min): {std_ratio:.2f}")
# Quick visualization: standard deviation by group
plt.figure(figsize=(8, 3))
disp["std"].sort_values().plot(kind="barh")
plt.title(f"Salary standard deviation by group: {col}")
plt.xlabel("STD(salary_usd)")
plt.ylabel(col)
plt.show()
| n | mean | median | std | var | q1 | q3 | iqr | cv | |
|---|---|---|---|---|---|---|---|---|---|
| experience_level | |||||||||
| EX | 3277 | 171357.764724 | 167899.0 | 42645.767697 | 1.818662e+09 | 138802.0 | 203278.0 | 64476.0 | 0.248870 |
| SE | 3741 | 122187.657845 | 116907.0 | 35261.617316 | 1.243382e+09 | 94173.0 | 145315.0 | 51142.0 | 0.288586 |
| MI | 3781 | 87955.471833 | 84641.0 | 25543.109400 | 6.524504e+08 | 67621.0 | 104483.0 | 36862.0 | 0.290410 |
| EN | 3718 | 63133.377084 | 60373.5 | 18558.096806 | 3.444030e+08 | 48516.0 | 74866.5 | 26350.5 | 0.293951 |
➡️ Highest dispersion (STD) in 'experience_level': EX (std=42,646 | IQR=64,476 | CV=0.25) ➡️ Lowest dispersion (STD) in 'experience_level': EN (std=18,558 | IQR=26,350 | CV=0.29) 📈 Dispersion ratio (std max / std min): 2.30
| n | mean | median | std | var | q1 | q3 | iqr | cv | |
|---|---|---|---|---|---|---|---|---|---|
| company_size | |||||||||
| L | 4721 | 120178.613429 | 111160.0 | 51196.951526 | 2.621128e+09 | 80483.0 | 155554.0 | 75071.0 | 0.426007 |
| M | 4855 | 108280.711843 | 96759.0 | 50461.054640 | 2.546318e+09 | 70841.5 | 139181.0 | 68339.5 | 0.466021 |
| S | 4941 | 99750.428658 | 88895.0 | 48242.994196 | 2.327386e+09 | 63454.0 | 128067.0 | 64613.0 | 0.483637 |
➡️ Highest dispersion (STD) in 'company_size': L (std=51,197 | IQR=75,071 | CV=0.43) ➡️ Lowest dispersion (STD) in 'company_size': S (std=48,243 | IQR=64,613 | CV=0.48) 📈 Dispersion ratio (std max / std min): 1.06
Interpreting the dispersion table: who varies more and why does it matter?¶
Once we compute variance / standard deviation / IQR / coefficient of variation (CV) by group, we can extract conclusions such as:
- Highest dispersion: the group with the largest STD/IQR has the widest salary range, which typically reflects stronger heterogeneity (mixed roles, broader seniority within the same label, bonuses, etc.).
- Lowest dispersion: the group with the smallest dispersion has more standardized pay bands (more homogeneous positions and clearer market pricing).
Therefore, higher dispersion within certain groups suggests heterogeneous career paths and compensation structures, possibly reflecting differences in seniority, specialization, negotiation power, or firm-level policies rather than a single dominant salary trajectory.
This is important because two groups can have different medians and very different uncertainty around those medians. In practice, that translates to wider vs. narrower salary “bands” for negotiation and benchmarking.
4.3 Overall Analysis: Kruskal–Wallis¶
Assumptions check (parametric requirements)¶
Before choosing a global test, we verified the main ANOVA assumptions:
- Normality: rejected (p < 0.05 in D’Agostino’s test).
- Homoscedasticity (equal variances): rejected (p < 0.05 in Levene’s test).
Because these parametric assumptions are violated, and because we need to compare more than two independent groups, we discard the one-way ANOVA.
Theoretical note: The Mann–Whitney U test is only appropriate for comparing two independent groups.
When the goal is to compare more than two groups in a non-parametric and robust way, we use the Kruskal–Wallis test, which generalizes Mann–Whitney to the multi-group setting.
Why Kruskal–Wallis?¶
Given the non-normality and heteroscedasticity, we adopt the Kruskal–Wallis test as the non-parametric alternative to ANOVA.
It evaluates whether the salary distributions differ across several groups (e.g., experience levels). We apply it to each candidate predictor to confirm whether it has a global effect on salary.
Hypotheses (Kruskal–Wallis)¶
- H₀: The salary distributions are the same across all groups (often interpreted as “equal medians”).
- H₁: At least one group differs (i.e., at least one distribution is systematically shifted relative to another).
Technical note (interpretation): Although Kruskal–Wallis is commonly described as a comparison of medians, strictly speaking it tests for stochastic dominance / distributional differences.
If group distributions have different shapes (which is plausible given the variance differences suggested by Levene’s test), a significant result means that for at least one pair of groups the probability that a randomly chosen observation from one group exceeds a randomly chosen observation from another is systematically different from 0.5.
This interpretation is more general and robust than “median differences” alone.
def test_kruskal_global(df, group_col, target_col):
groups = [df[df[group_col] == val][target_col] for val in df[group_col].unique()]
stat, p_value = stats.kruskal(*groups)
print(f"Kruskal–Wallis test for variable: {group_col}")
print(f" H statistic: {stat:.4f} | p-value: {p_value:.4e}")
if p_value < 0.05:
print(f" ✅ H0 rejected: '{group_col}' significantly affects salary.")
else:
print(f" ❌ H0 not rejected: we cannot claim significant differences.")
print("-" * 98)
# Apply to our two candidate predictors
test_kruskal_global(df_clean, 'experience_level', 'salary_usd')
test_kruskal_global(df_clean, 'company_size', 'salary_usd')
Kruskal–Wallis test for variable: experience_level H statistic: 9368.7891 | p-value: 0.0000e+00 ✅ H0 rejected: 'experience_level' significantly affects salary. -------------------------------------------------------------------------------------------------- Kruskal–Wallis test for variable: company_size H statistic: 449.3941 | p-value: 2.6021e-98 ✅ H0 rejected: 'company_size' significantly affects salary. --------------------------------------------------------------------------------------------------
# Effect size for Kruskal–Wallis: how large is the global difference?
def kruskal_effect_size(df, group_col, target_col="salary_usd", observed=True):
groups = [g[target_col].values for _, g in df.groupby(group_col, observed=observed)]
H, p = stats.kruskal(*groups)
n = sum(len(x) for x in groups)
k = len(groups)
# Epsilon squared (ε²) is a common effect size for Kruskal–Wallis
epsilon_sq = (H - k + 1) / (n - k)
# Alternative: approximate eta squared (η²) for reference
eta_sq = H / (n - 1)
return H, p, epsilon_sq, eta_sq, n, k
for col in ["experience_level", "company_size"]:
H, p, eps2, eta2, n, k = kruskal_effect_size(df_clean, col, "salary_usd", observed=True)
print(f"\nEffect size for: {col}")
print(f"H = {H:,.2f} | p = {p:.2e} | n = {n} | k = {k}")
print(f"ε² (epsilon squared) = {eps2:.4f}")
print(f"η² (approx.) = {eta2:.4f}")
# Rule of thumb (indicative): 0.01 small, 0.06 moderate, 0.14 large
if eps2 < 0.01:
mag = "small"
elif eps2 < 0.06:
mag = "moderate"
elif eps2 < 0.14:
mag = "medium"
else:
mag = "large"
print(f"➡️ Interpretation (indicative): {mag} effect for {col}.")
Effect size for: experience_level H = 9,368.79 | p = 0.00e+00 | n = 14517 | k = 4 ε² (epsilon squared) = 0.6453 η² (approx.) = 0.6454 ➡️ Interpretation (indicative): large effect for experience_level. Effect size for: company_size H = 449.39 | p = 2.60e-98 | n = 14517 | k = 3 ε² (epsilon squared) = 0.0308 η² (approx.) = 0.0310 ➡️ Interpretation (indicative): moderate effect for company_size.
Effect size: how large is the difference?¶
Statistical significance tells us whether differences exist, but not how large they are. Therefore, we compute an effect size for Kruskal–Wallis (e.g., $\varepsilon^2$ or an approximate $\eta^2$) to quantify the proportion of variance explained by each categorical predictor.
This answers: Is the factor merely significant, or also meaningful in magnitude?
4.4 Specific Analysis (Post-hoc): Quantifying Differences (Bootstrapping)¶
The global Kruskal–Wallis test confirms that significant differences exist across groups, but it does not tell us where those differences are or how large they are.
To deepen the analysis, we run a post-hoc study using bootstrapping. This technique allows us to build a 95% Confidence Interval (CI) for the difference in central tendency between key pairs (typically the difference in means, and optionally the difference in medians). This provides two complementary benefits:
Statistical Significance:
If the CI for the difference does not include 0, we interpret it as evidence of a real disparity between groups (conceptually equivalent to a p-value < 0.05).Economic / Business Quantification:
Unlike a p-value, the CI yields an interpretable range in dollar terms (e.g., “they earn between X and Y more”), which is crucial for business decision-making and compensation strategy.
All confidence intervals are constructed via 95% bootstrapping.
Why bootstrapping?
- It is non-parametric and does not rely on normality assumptions.
- It produces an intuitive confidence interval for the difference (mean and/or median).
- It is robust to skewness and outliers, and easy to interpret in business terms.
Note: Although bootstrapping is often used with small samples, we apply it here despite having N > 14,000 because its non-parametric nature helps produce confidence intervals that respect the true asymmetry of the salary distribution, without imposing assumptions of normality or symmetry.
Additionally, in the previous cell we also report the relative (%) difference in mean and median to contextualize practical relevance (not just significance).
Questions to answer (pairs of interest): We apply this technique to the comparisons we consider most informative:
- Experience: How much more does a Senior (SE) profile earn compared to a Mid-Level (MI) profile?
- Company size: Do Medium (M) companies actually pay more than Small (S) companies?
def bootstrap_difference(df, group_col, control_group, test_group, target='salary_usd'):
# Data filtering
data_control = df[df[group_col] == control_group][target].values
data_test = df[df[group_col] == test_group][target].values
# Statistic function (Difference in means: Test - Control)
# Vectorized ('np.mean(axis=-1)' is faster and more technically robust than 'data.mean()')
def diff_means(x, y, axis=-1):
return np.mean(y, axis=axis) - np.mean(x, axis=axis)
# Bootstrap execution
# We set 'random_state' for reproducibility
res = bootstrap(
(data_control, data_test),
diff_means,
confidence_level=0.95,
random_state=24
)
ci = res.confidence_interval
mean_diff = np.mean(data_test) - np.mean(data_control)
print(f"\n📊 BOOTSTRAP: {group_col} ({test_group} vs {control_group})")
print(f" Observed Mean Difference: ${mean_diff:,.2f}")
print(f" Confidence Interval (95%): [${ci.low:,.2f}, ${ci.high:,.2f}]")
if ci.low > 0:
print(f" ✅ CONCLUSION: Significant positive difference ({test_group} earns more).")
elif ci.high < 0:
print(f" ✅ CONCLUSION: Significant negative difference ({control_group} earns more).")
else:
print(" ❌ CONCLUSION: Not significant (the interval includes 0).")
# Specific experience analysis (Senior vs Mid)
bootstrap_difference(df_clean, 'experience_level', 'MI', 'SE')
# Specific company size analysis (Medium vs Small)
# We compare M vs S because it represents the first growth jump from a startup
bootstrap_difference(df_clean, 'company_size', 'S', 'M')
📊 BOOTSTRAP: experience_level (SE vs MI) Observed Mean Difference: $34,232.19 Confidence Interval (95%): [$32,857.99, $35,623.00] ✅ CONCLUSION: Significant positive difference (SE earns more). 📊 BOOTSTRAP: company_size (M vs S) Observed Mean Difference: $8,530.28 Confidence Interval (95%): [$6,578.70, $10,509.18] ✅ CONCLUSION: Significant positive difference (M earns more).
# Statistical significance vs practical relevance: absolute and relative difference (%)
def pair_practical(df, group_col, group_a, group_b, target_col="salary_usd"):
a = df.loc[df[group_col] == group_a, target_col]
b = df.loc[df[group_col] == group_b, target_col]
mean_a, mean_b = a.mean(), b.mean()
med_a, med_b = a.median(), b.median()
diff_mean = mean_a - mean_b
pct_mean = diff_mean / mean_b * 100
diff_med = med_a - med_b
pct_med = diff_med / med_b * 100
print(f"\n📌 Practical comparison: {group_col} ({group_a} vs {group_b})")
print(f"Mean: {group_a} = {mean_a:,.0f} | {group_b} = {mean_b:,.0f} | Δ = {diff_mean:,.0f} ({pct_mean:.1f}%)")
print(f"Median: {group_a} = {med_a:,.0f} | {group_b} = {med_b:,.0f} | Δ = {diff_med:,.0f} ({pct_med:.1f}%)")
pair_practical(df, "experience_level", "SE", "MI", "salary_usd")
pair_practical(df, "company_size", "M", "S", "salary_usd")
📌 Practical comparison: experience_level (SE vs MI) Mean: SE = 122,188 | MI = 87,955 | Δ = 34,232 (38.9%) Median: SE = 116,907 | MI = 84,641 | Δ = 32,266 (38.1%) 📌 Practical comparison: company_size (M vs S) Mean: M = 113,600 | S = 102,147 | Δ = 11,453 (11.2%) Median: M = 98,122 | S = 89,648 | Δ = 8,474 (9.5%)
A “practical” reading of the results (beyond the confidence interval):
- Here we do not only confirm that the difference is different from 0; we also express it as an absolute Δ (USD) and a relative Δ (percentage).
- In a business context, this translation into impact is key: a statistically significant difference might not be relevant if it is small in relative terms—and vice versa.
Experience (SE vs MI):
The 95% confidence interval for the difference in means is above 0, confirming that Senior profiles earn more than Mid-Level profiles.
Company size (M vs S):
The 95% confidence interval is also above 0, indicating that Medium companies pay more than Small ones.
Note: In the previous cell we also report the relative (%) difference in mean and median to contextualize practical relevance (not just statistical significance).
5. Correlation and Association Analysis¶
In this analysis, the target variable (salary_usd) is numerical, while the predictors (experience_level and company_size) are ordinal categorical variables.
Following the course methodology, when working with ordinal qualitative variables against a numerical target, we approach the analysis from two complementary perspectives to ensure statistical rigor and compliance with assumptions:
Trend and Direction (Spearman Correlation):
- We convert categories into numerical ranks (encoding).
- We use Spearman’s correlation (instead of Pearson’s) because it is more robust to non-normal data and is specifically designed to evaluate monotonic relationships (increasing or decreasing) between ranked variables. This allows us to confirm the direction of the relationship.
True Effect Strength (Association Metrics):
- Eta-squared ($\eta^2$): Used to quantify the actual effect size, measuring the percentage of salary variance explained by each variable (without assuming strict linearity). This metric complements the Kruskal–Wallis test and helps assess the relevance of each variable as a predictor.
- Cramér’s V: Used to validate independence between predictors and rule out multicollinearity.
Specifically, to assess whether
experience_levelandcompany_sizeare associated with each other (i.e., potential categorical multicollinearity), we apply the Chi-square test and compute Cramér’s V, which normalizes the strength of association (0 = none, 1 = perfect) between nominal or ordinal variables.
Together, these metrics allow us to statistically validate both the usefulness and non-redundancy of the variables before incorporating them into the regression model.
# TREND ANALYSIS (Spearman Correlation)
print("--- SPEARMAN CORRELATION MATRIX (Direction) ---")
# Prepare the data -> numeric encoding while preserving the hierarchy (Junior < Senior)
df_corr = df_clean.copy()
df_corr['exp_code'] = df_corr['experience_level'].cat.codes
df_corr['size_code'] = df_corr['company_size'].cat.codes
# Compute correlation (method='spearman' due to non-normality and ordinal nature; it is also more robust to outliers)
corr_matrix = df_corr[['salary_usd', 'exp_code', 'size_code']].corr(method='spearman')
# Visualization
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', fmt=".2f", vmin=-1, vmax=1)
plt.title("Spearman Correlation")
plt.show()
--- SPEARMAN CORRELATION MATRIX (Direction) ---
Interpretation of Spearman’s correlation
salary_usd ↔ exp_code (ρ = 0.80)
There is a strong, positive monotonic correlation: as experience level increases, the salary trend increases as well. This is consistent with the test results and with the effect size we will compute next using η².salary_usd ↔ size_code (ρ = 0.17)
The relationship is weak and positive. Larger companies show a slightly higher salary trend, although the strength of the association is low (in line with η² ≈ 0.03, shown below).exp_code ↔ size_code (ρ = −0.03)
The correlation is virtually zero. Company size and experience level behave independently, with no meaningful monotonic relationship. This confirms there is no risk of categorical multicollinearity between the predictors.
# EFFECT SIZE AND ASSOCIATION ANALYSIS (Advanced Metrics)
# First, we define the functions we will need
def compute_eta_squared(df, cat_col, num_col):
"""
Computes effect size (Eta-squared) using the Kruskal–Wallis H statistic.
Interpretation: % of variance in the numerical variable explained by the categorical variable.
"""
groups = [df[df[cat_col] == val][num_col] for val in df[cat_col].unique()]
H, _ = stats.kruskal(*groups)
k = len(groups) # k = number of groups
n = len(df) # n = total number of observations
eta_sq = (H - k + 1) / (n - k)
return max(0, eta_sq)
"""
More detailed interpretation:
- Numerator (H - k + 1)
Adjusts the H statistic by removing the “noise” introduced by having multiple groups
- Denominator (n - k)
Normalizes the result so it becomes a proportion between 0 and 1
- Result
A value between 0 and 1, interpreted as:
“Proportion of salary variability explained by the categorical variable”
"""
def compute_cramers_v(df, col1, col2):
"""
Computes Cramér’s V based on the Chi-square statistic.
Measures the association between two categorical variables (0 = none, 1 = perfect).
"""
confusion_matrix = pd.crosstab(df[col1], df[col2])
chi2 = stats.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
# Normalize chi2
phi2 = chi2 / n # chi2 measures deviation from independence, and n = total sample size
r, k = confusion_matrix.shape # r = number of rows and k = number of columns
# Bias correction (recommended for more robust results)
with np.errstate(divide='ignore', invalid='ignore'):
# Subtract bias to avoid inflating association
phi2corr = max(0, phi2 - ((k - 1) * (r - 1)) / (n - 1))
# Adjust “effective categories” to reduce error due to small cells
rcorr = r - ((r - 1) ** 2) / (n - 1)
kcorr = k - ((k - 1) ** 2) / (n - 1)
# Corrected formula -> robust V between 0 and 1
v = np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))
return v
# RUN OUR ANALYSIS
print("--- EFFECT SIZE ON SALARY (Eta-squared) ---")
# Evaluate how much each predictor influences salary
for col in ['experience_level', 'company_size']:
eta = compute_eta_squared(df_clean, col, 'salary_usd')
print(f"Variable '{col}':")
print(f" -> η² = {eta:.4f} (Explains {eta*100:.1f}% of the salary variance)")
print("\n--- RELATIONSHIP BETWEEN PREDICTORS (Multicollinearity - Cramér's V) ---")
# Evaluate whether predictors provide redundant information
v_cramer = compute_cramers_v(df_clean, 'experience_level', 'company_size')
print("Association between 'experience_level' and 'company_size':")
print(f" -> V = {v_cramer:.4f}")
# Automatic interpretation
if v_cramer < 0.3:
print(" ✅ CONCLUSION: Low association. Predictors provide unique information (No multicollinearity risk).")
elif v_cramer < 0.7:
print(" ⚠️ CONCLUSION: Moderate association. Watch for possible redundancy (Potential multicollinearity risk).")
else:
print(" ❌ CONCLUSION: Very high association. Likely severe multicollinearity.")
--- EFFECT SIZE ON SALARY (Eta-squared) --- Variable 'experience_level': -> η² = 0.6453 (Explains 64.5% of the salary variance) Variable 'company_size': -> η² = 0.0308 (Explains 3.1% of the salary variance) --- RELATIONSHIP BETWEEN PREDICTORS (Multicollinearity - Cramér's V) --- Association between 'experience_level' and 'company_size': -> V = 0.0370 ✅ CONCLUSION: Low association. Predictors provide unique information (No multicollinearity risk).
6. Predictive Modeling: Multiple Linear Regression¶
Once we have confirmed the statistical relevance of the predictors (experience_level and company_size) and ruled out multicollinearity, we proceed to build a Multiple Linear Regression (OLS) model.
The goal is to obtain a mathematical equation that allows us to estimate the expected salary ($Y$) as a function of experience level ($X_1$) and company size ($X_2$):
$$\text{Salary} = \beta_0 + \beta_1 \cdot (\text{Experience}) + \beta_2 \cdot (\text{Company Size}) + \epsilon$$
6.1 Variable Preparation: Dummy Encoding¶
To avoid imposing an artificial metric on our categorical variables (i.e., incorrectly assuming that the salary distance between Junior and Mid is the same as between Senior and Executive), we will use Dummy Variables (One-Hot Encoding).
This allows the model to estimate an independent coefficient (a “salary premium”) for each level, without forcing linearity constraints across categories.
The general equation we aim to estimate is:
$$Y = \beta_0 + \underbrace{\beta_1 D_{MI} + \beta_2 D_{SE} + \beta_3 D_{EX}}_{\text{Experience Levels}} + \underbrace{\beta_4 D_{M} + \beta_5 D_{L}}_{\text{Company Size}} + \epsilon$$
# VARIABLE PREPARATION (DUMMIES)
# NOTE:
# There are two common ways to do this in statsmodels:
# A) Using formulas (smf.ols): "salary_usd ~ C(experience_level) + C(company_size)"
# More automatic because it creates dummies internally
# B) Using arrays (sm.OLS): requires creating dummies manually via pd.get_dummies
# We'll use option (B) to keep full control over the columns and be explicit.
# Create dummy variables (one-hot encoding)
# drop_first=True removes one category as the reference (avoids perfect multicollinearity)
# dtype=int ensures 0/1 instead of True/False (avoids dtype issues in statsmodels)
X_dummies = pd.get_dummies(df_clean[['experience_level', 'company_size']], drop_first=True, dtype=int)
# Add constant (intercept) for OLS
X = sm.add_constant(X_dummies)
# Define target variable (y) and its log transform
y = df_clean['salary_usd']
y_log = np.log(df_clean['salary_usd']) # log transform to correct heteroscedasticity
print("Variables ready. Model columns:", X.columns.tolist())
# Verify numeric types
print("\nData types in X:")
print(X.dtypes)
Variables ready. Model columns: ['const', 'experience_level_MI', 'experience_level_SE', 'experience_level_EX', 'company_size_M', 'company_size_L'] Data types in X: const float64 experience_level_MI int32 experience_level_SE int32 experience_level_EX int32 company_size_M int32 company_size_L int32 dtype: object
6.2 Model Selection: Linear vs. Logarithmic¶
To ensure the validity of our predictions, we must satisfy the Gauss–Markov assumptions, especially Homoscedasticity (the error variance should be constant).
First, we fit a baseline linear model on the raw salary target ($Y$) to diagnose residual behavior and evaluate whether transformations are needed.
# PLAIN LINEAR MODEL (diagnostic baseline)
# Fit the model without log transform
model_linear = sm.OLS(y, X).fit()
print("--- LINEAR MODEL DIAGNOSTICS (NO LOG) ---")
print(f"R-squared: {model_linear.rsquared:.4f}")
print(f"Log-Likelihood: {model_linear.llf:.2f}")
print(f"AIC: {model_linear.aic:.2f}")
print(f"BIC: {model_linear.bic:.2f}")
# Residual plot to detect heteroscedasticity
residuals_linear = model_linear.resid
plt.figure(figsize=(10, 6))
plt.scatter(model_linear.fittedvalues, residuals_linear, alpha=0.5, color='gray')
plt.title("Residual diagnostics (linear model)", fontsize=14)
plt.xlabel("Predicted salary", fontsize=12)
plt.ylabel("Residuals (error)", fontsize=12)
plt.axhline(0, color='red', linestyle='--', linewidth=2)
plt.show()
--- LINEAR MODEL DIAGNOSTICS (NO LOG) --- R-squared: 0.6522 Log-Likelihood: -170192.26 AIC: 340396.52 BIC: 340442.01
Heteroscedasticity Correction: Logarithmic Transformation¶
After inspecting the residuals of the linear model, we detect a “fan-shaped” pattern (heteroscedasticity): the errors increase as salary increases. To correct this, we apply a logarithmic transformation to the dependent variable ($\log(Y)$).
This changes the model interpretation: the coefficients ($\beta$) will represent percentage changes in salary rather than fixed monetary amounts, which better matches real-world labor market dynamics.
$$\log(\text{Salary}) = \beta_0 + \beta X + \epsilon$$
# LOG MODEL
# Fit the model using log salary
model_log = sm.OLS(y_log, X).fit()
# Show the full statistical summary
print(model_log.summary())
# ASSUMPTION DIAGNOSTICS (Gauss–Markov validation) ---
residuals = model_log.resid
fitted_vals = model_log.fittedvalues
fig, ax = plt.subplots(1, 2, figsize=(15, 6))
# -> Homoscedasticity (Residuals vs Fitted)
# We look for a shapeless "cloud". If there is no "fan" pattern, great!
sns.scatterplot(x=fitted_vals, y=residuals, ax=ax[0], alpha=0.6)
ax[0].axhline(0, color='red', linestyle='--', linewidth=2)
ax[0].set_title('Homoscedasticity', fontsize=12)
ax[0].set_xlabel('Predicted log(Salary)')
ax[0].set_ylabel('Residuals')
# -> Normality (Q-Q Plot)
# We look for blue points to follow the red line
sm.qqplot(residuals, line='45', fit=True, ax=ax[1])
ax[1].set_title('Normality (Q-Q Plot)', fontsize=12)
plt.tight_layout()
plt.show()
print("Interpretation:")
print("1. Left plot: if we see a dispersed cloud with no fan shape, we have corrected heteroscedasticity.")
print("2. Right plot: if the points align with the red line, residuals are approximately normally distributed.")
OLS Regression Results
==============================================================================
Dep. Variable: salary_usd R-squared: 0.679
Model: OLS Adj. R-squared: 0.678
Method: Least Squares F-statistic: 6127.
Date: Wed, 14 Jan 2026 Prob (F-statistic): 0.00
Time: 21:35:05 Log-Likelihood: -1170.5
No. Observations: 14517 AIC: 2353.
Df Residuals: 14511 BIC: 2399.
Df Model: 5
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
const 10.9016 0.005 2071.176 0.000 10.891 10.912
experience_level_MI 0.3288 0.006 54.258 0.000 0.317 0.341
experience_level_SE 0.6576 0.006 108.225 0.000 0.646 0.669
experience_level_EX 1.0173 0.006 161.743 0.000 1.005 1.030
company_size_M 0.0998 0.005 18.818 0.000 0.089 0.110
company_size_L 0.2345 0.005 43.853 0.000 0.224 0.245
==============================================================================
Omnibus: 1233.878 Durbin-Watson: 1.986
Prob(Omnibus): 0.000 Jarque-Bera (JB): 396.771
Skew: 0.067 Prob(JB): 6.95e-87
Kurtosis: 2.201 Cond. No. 5.29
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpretation: 1. Left plot: if we see a dispersed cloud with no fan shape, we have corrected heteroscedasticity. 2. Right plot: if the points align with the red line, residuals are approximately normally distributed.
# ADDITIONAL RESIDUAL PLOTS (Histogram and Boxplot)
fig, ax = plt.subplots(1, 2, figsize=(15, 6))
# Histogram
sns.histplot(residuals, bins=30, kde=True, ax=ax[0])
ax[0].set_title("Residuals histogram", fontsize=12)
ax[0].set_xlabel("Error (residuals)")
# Boxplot (to see outliers)
sns.boxplot(x=residuals, ax=ax[1])
ax[1].set_title("Residuals boxplot", fontsize=12)
ax[1].set_xlabel("Error distribution")
plt.show()
# Shape metrics
print(f"Skewness: {residuals.skew():.4f}")
print(f"Kurtosis: {residuals.kurtosis():.4f}")
Skewness: 0.0673 Kurtosis: -0.7985
6.3 Detailed Interpretation of Statistical Results¶
Below, we break down all the metrics returned by statsmodels and the residual plots to validate the model’s robustness and understand its impact on our study.
- No. Observations: 14,517 → number of rows
- Df Residuals: 14,511 (14,517 − 6 estimated parameters → 14,511 degrees of freedom)
- Df Model: 5 (number of columns in X)
- Covariance Type: nonrobust (heteroscedasticity-robust covariance was not assumed)
1. Goodness of Fit and Information Criteria¶
R-squared (0.679): The model explains 67.9% of the variance in salary. This means that nearly 70% of salary differences are explained purely by Experience and Company Size.
Adj. R-squared (0.678): Practically identical to R², confirming that we did not include irrelevant predictors.
F-statistic (6127) and Prob (F-statistic) (0.00): The p-value is effectively zero (< 0.05). We can confidently conclude the model is not due to chance; the predictors have a real relationship with salary. The model is statistically significant overall.
Log-Likelihood (−1170.5), AIC (Akaike Information Criterion: 2353) and BIC (Bayesian Information Criterion: 2399):
- These values measure model fit while penalizing complexity.
- Compared to the linear model (AIC ~ 340,000), the drop to 2,353 is dramatic. Although the change of scale (log vs. linear) affects comparability, such a low AIC together with a less negative log-likelihood strongly suggests the log model captures the data structure far better than the raw linear model.
2. Economic Interpretation of the Coefficients ($\beta$)¶
Since this is a log-level model ($\log(Y) = \beta X$), coefficients are interpreted as approximate percentage changes relative to the baseline salary:
Intercept (const = 10.90): Baseline log value for an Entry-level employee in a Small company.
- Translation: $e^{10.90} \approx \$54,176$. This is the estimated baseline salary.
experience_level(the main driver):- Mid-Level ($\beta \approx 0.33$): Salary increases by ~39% relative to the baseline ($e^{0.33} - 1$).
- Senior ($\beta \approx 0.66$): Salary increases by ~93% relative to the baseline ($e^{0.66} - 1$).
- Executive ($\beta \approx 1.02$): Salary increases by ~177% ($e^{1.02} - 1$), a substantial jump as expected at these organizational levels.
company_size(a secondary but meaningful effect):- Large ($\beta \approx 0.23$): Working at a large company implies a ~26% salary premium ($e^{0.23} - 1$) relative to a small one, holding experience constant.
3. Assumption Validation (Residual Diagnostics)¶
To confirm model reliability, we combine statistical tests with visual inspection of residual plots:
A) Homoscedasticity and Independence
- Durbin–Watson (1.986): Near the ideal value (~2), indicating no autocorrelation in the errors.
- Cond. No. (Condition Number) (5.29): Assesses multicollinearity. Values > 30 may indicate multicollinearity (though not necessarily severe). A value of 5.29 (well below 30) is excellent, confirming what we saw with Cramér’s V: predictors are independent and the model is stable. There is no concerning multicollinearity.
- Residuals vs. Fitted scatter plot: The point cloud appears stable and evenly spread along the horizontal axis. We have successfully removed the “fan-shaped” pattern present in the linear model, satisfying the constant-variance assumption.
B) Residual Normality
Numeric tests (Jarque–Bera (JB) = 396, Prob = 0.00 and Omnibus = 1233, Prob = 0.00): These reject perfect normality (Prob < 0.05), largely because the sample is very large ($N > 14,000$), making the tests highly sensitive to small deviations.
Q–Q plot: Excellent fit in the center (where most observations lie). Deviations in the tails indicate the model is highly accurate for typical salaries but less precise for extremely high or low salaries.
Histogram and boxplot:
- The histogram shows a fairly symmetric bell shape (Skew ≈ 0.06), indicating errors are not strongly biased to one side.
- The boxplot reveals some isolated outliers, explaining why Jarque–Bera rejects strict normality, even though the bulk of the distribution behaves well.
Skew (0.067): Indicates near-perfect symmetry (very close to 0). Errors are evenly distributed around zero, which is excellent.
Kurtosis (2.201): A normal distribution has kurtosis 3. A value of 2.2 indicates a platykurtic distribution (flatter than normal, with somewhat lighter tails overall), although some tail outliers still exist.
Validation Conclusion: The model is robust. Although some outliers generate deviations in the tails, the symmetry and homoscedasticity in the core of the data ensure that coefficient estimates and confidence intervals are valid for the vast majority of the population.
Modeling Conclusions¶
After the full fitting and validation process, we conclude:
- Valid model: Despite mild non-normality in the tails (highlighted by Jarque–Bera and the boxplot), the model satisfies the critical assumptions of homoscedasticity, linearity, and error symmetry. It is a reliable estimation tool.
- Factor hierarchy: Experience is statistically proven to be the dominant driver of salary, with exponential jumps at Senior/Executive levels. Company size acts as a secondary but significant multiplier.
- Predictive capability: With an $R^2$ close to 68%, the model provides highly accurate salary-range estimates for most of the market, although caution is needed for extreme cases (detected outliers).
6.4 Prediction and Confidence Intervals¶
Finally, we use the optimized logarithmic model to generate predictions. Since a single point prediction is not sufficient for decision-making, we will quantify the associated uncertainty in two complementary ways:
- Analytical interval: The classical computation of a 95% confidence interval for a specific case.
- Robust validation (bootstrapping): We will simulate 2,000 random train/test scenarios to estimate the model’s true Root Mean Squared Error (RMSE). This will tell us, with 95% confidence, how far off our model is on average when predicting the salary of a new employee.
# A) PREDICTION FOR A HYPOTHETICAL CASE
# Let's imagine a new case: a 'Senior' (SE) in a 'Large' (L) company
new_employee = pd.DataFrame(columns=X.columns)
new_employee.loc[0] = 0 # initialize all predictors to 0
new_employee['const'] = 1
# Activate the corresponding dummy variables
if 'experience_level_SE' in X.columns:
new_employee['experience_level_SE'] = 1
if 'company_size_L' in X.columns:
new_employee['company_size_L'] = 1
# Prediction and inverse log-transform (np.exp)
pred_log = model_log.predict(new_employee)[0]
pred_usd = np.exp(pred_log)
# Analytical confidence interval (95%)
pred_summary = model_log.get_prediction(new_employee).summary_frame(alpha=0.05)
ci_lower = np.exp(pred_summary['obs_ci_lower'][0])
ci_upper = np.exp(pred_summary['obs_ci_upper'][0])
print("--- CASE PREDICTION: Senior in a Large Company ---")
print(f"Estimated salary: ${pred_usd:,.2f}")
print(f"Interval (95%): [${ci_lower:,.2f} - ${ci_upper:,.2f}]")
# B) ROBUST VALIDATION (BOOTSTRAPPING)
rmse_list = []
n_iter = 2000
# Set a seed for reproducibility
np.random.seed(24)
print(f"\nRunning bootstrapping ({n_iter} iterations)...")
for _ in tqdm(range(n_iter)):
# Split into train/test (randomness is controlled by the fixed seed)
X_train, X_test, y_train, y_test = train_test_split(X, y_log, test_size=0.2) # note: no random_state here
# Train and predict
m_boot = sm.OLS(y_train, X_train).fit() # model is fitted here
preds_log = m_boot.predict(X_test)
# Convert back to real USD to compute the true error
rmse = np.sqrt(mean_squared_error(np.exp(y_test), np.exp(preds_log)))
rmse_list.append(rmse)
# Compute mean and percentiles
rmse_mean = np.mean(rmse_list)
rmse_lower = np.percentile(rmse_list, 2.5)
rmse_upper = np.percentile(rmse_list, 97.5)
print(f"Global mean RMSE: ${rmse_mean:,.2f}")
print(f"Error range (95%): [${rmse_lower:,.0f} - ${rmse_upper:,.0f}]")
--- CASE PREDICTION: Senior in a Large Company --- Estimated salary: $132,412.64 Interval (95%): [$79,169.73 - $221,462.28] Running bootstrapping (2000 iterations)...
100%|██████████| 2000/2000 [00:18<00:00, 108.63it/s]
Global mean RMSE: $30,146.95 Error range (95%): [$29,360 - $30,976]
6.5 3D Visualization (Error Visualization)¶
To visually inspect prediction performance, we build a 3D plot comparing:
- Experience level (x-axis)
- Company size (y-axis)
- Actual vs predicted salary (z-axis)
This helps identify where errors concentrate (e.g., at the extremes).
# Prepare a sample and create numeric codes on the fly (also set a seed)
# We use sample(500) to visualize density without overcrowding the plot
df_viz = df_clean.sample(500, random_state=24).copy()
# Since we defined the logical ordering (Entry -> Executive) at the beginning of the analysis,
# we can use .cat.codes to automatically respect that ordering
df_viz['exp_code_viz'] = df_viz['experience_level'].cat.codes
df_viz['size_code_viz'] = df_viz['company_size'].cat.codes
# Generate predictions for this sample
# Recreate the dummy variables for the model
X_viz = pd.get_dummies(df_viz[['experience_level', 'company_size']], drop_first=True, dtype=int)
X_viz = sm.add_constant(X_viz)
# Fill missing columns with 0 (in case the sample is missing a category)
for col in X.columns:
if col not in X_viz.columns:
X_viz[col] = 0
X_viz = X_viz[X.columns] # reorder columns to match the original model
# Predict
pred_log_viz = model_log.predict(X_viz)
df_viz['pred_usd'] = np.exp(pred_log_viz)
# Build the INTERACTIVE 3D PLOT
fig = go.Figure()
# A) Actual points (Blue)
fig.add_trace(go.Scatter3d(
x=df_viz['exp_code_viz'],
y=df_viz['size_code_viz'],
z=df_viz['salary_usd'],
mode='markers',
marker=dict(size=5, color='blue', opacity=0.7),
name="Actual Data"
))
# B) Predicted points (Orange)
fig.add_trace(go.Scatter3d(
x=df_viz['exp_code_viz'],
y=df_viz['size_code_viz'],
z=df_viz['pred_usd'],
mode='markers',
marker=dict(size=5, color='orange', opacity=0.9, symbol='diamond'),
name="Model Prediction"
))
# C) Residual lines (Red)
for x_real, y_real, z_real, z_pred in zip(
df_viz['exp_code_viz'],
df_viz['size_code_viz'],
df_viz['salary_usd'],
df_viz['pred_usd']
):
fig.add_trace(go.Scatter3d(
x=[x_real, x_real],
y=[y_real, y_real],
z=[z_real, z_pred],
mode='lines',
line=dict(color='red', width=2),
showlegend=False # hide from the legend
))
# Final layout configuration with axis text labels
fig.update_layout(
title="Prediction Errors: Actual vs Model (Sample of 500 cases)",
scene=dict(
xaxis=dict(title="Experience Level", tickvals=[0, 1, 2, 3], ticktext=["Entry", "Mid", "Senior", "Executive"]),
yaxis=dict(title="Company Size", tickvals=[0, 1, 2], ticktext=["Small", "Medium", "Large"]),
zaxis_title="Salary ($)"
),
width=900,
height=700
)
fig.show()
3D Plot Interpretation: The blue points represent the actual observed values, and the orange points represent the model’s predictions.
- “Stair-step” effect: Since the predictors are categorical, the model outputs a fixed predicted value for each category combination (e.g., Senior + Large). Visually, this creates vertical “columns” where the blue points (real salaries) cluster around the orange points (predicted salaries).
- Residuals (Red lines): The length of each red line represents the individual prediction error. We can see the model performs much better for low and mid salary ranges (short lines), but the error grows substantially for executive profiles (longer red lines), confirming the presence of outliers and higher variance among the highest salaries.
6.6 Model Uncertainty Analysis¶
Finally, we quantify predictive uncertainty via bootstrapping: we repeatedly resample the data, fit the model, and compute RMSE on held-out points.
This provides a distribution of the expected error and a robust 95% uncertainty range for model performance.
# Convert the bootstrapped error list to a DataFrame
errors_df = pd.DataFrame({"rmse": rmse_list})
# RMSE histogram
plt.figure(figsize=(10, 6))
errors_df['rmse'].hist(bins=30, edgecolor='black', alpha=0.7)
plt.title("Distribution of mean error (RMSE) across 2000 simulations")
plt.xlabel("Mean error in dollars ($)")
plt.ylabel("Frequency")
plt.axvline(errors_df['rmse'].mean(), color='red', linestyle='dashed', linewidth=2, label='Mean')
plt.legend()
plt.show()
# Distribution statistics
print(f"Skewness of error: {errors_df['rmse'].skew():.4f}")
print(f"Kurtosis of error: {errors_df['rmse'].kurtosis():.4f}")
# Confidence interval
mean_error = errors_df['rmse'].mean()
std_error = errors_df['rmse'].std()
lower = np.percentile(errors_df['rmse'], 2.5)
upper = np.percentile(errors_df['rmse'], 97.5)
print("\n--- FINAL ACCURACY CONCLUSION (Percentile method) ---")
print(f"The model's mean RMSE is ${mean_error:,.0f}.")
print("With 95% confidence, the prediction error will lie within:")
print(f"[{lower:,.0f}, {upper:,.0f}]")
print(f"This shows high model stability, with a standard deviation of only ${std_error:,.0f} across simulations.")
Skewness of error: 0.0438 Kurtosis of error: 0.0910 --- FINAL ACCURACY CONCLUSION (Percentile method) --- The model's mean RMSE is $30,147. With 95% confidence, the prediction error will lie within: [29,360, 30,976] This shows high model stability, with a standard deviation of only $412 across simulations.
7. Final Conclusions¶
After conducting an exhaustive statistical analysis on a sample of more than 14,000 records from the AI job market in 2025, we arrive at the following conclusions, grounded in data-driven evidence and predictive modeling:
1. Hierarchy of Salary Drivers¶
Both the association analysis and the regression model converge on a key finding:
- Experience is the primary driver: Experience level (
experience_level) alone explains 64.5% of salary variability ($\eta^2 \approx 0.65$). The salary jump from Mid-Level to Senior is substantial, and it becomes even more pronounced at Executive levels. - Company size is secondary: Although the effect is statistically significant, company size (
company_size) explains only about 3% of the variability ($\eta^2 \approx 0.03$). While large companies (L) tend to pay more than small ones (S), this factor is far less decisive than a candidate’s experience.
2. Market Behavior¶
- Positive skewness: Salary distributions are not normal. There is a strong concentration in low-to-mid ranges and a long right tail of extremely high salaries—typical of a fast-growing technology sector.
- Uneven dispersion across segments: Beyond mean differences, salary variability differs across categories (Levene’s test). This points to wider salary bands in certain segments (e.g., negotiation power, specialization, location, bonuses), reinforcing the need to report intervals and rely on robust methods.
- Independence of factors: Using Cramér’s V (0.03), we show there is no meaningful association between company size and experience level. Small firms also hire Senior profiles, and large firms also hire Juniors—without a strong structural bias.
3. Reliability of the Predictive Model¶
For salary prediction, we discarded the simple linear model due to heteroscedasticity and opted for a log-linear model (OLS on log-salary), yielding robust results:
Explanatory power: The model achieves an $R^2$ of 67.9%, meaning it explains nearly 7 out of every 10 dollars of salary differences using only these two variables.
Accuracy: Through bootstrap-based cross-validation (2,000 iterations), we estimate the model’s average error (RMSE) at approximately $30,147 USD.
- Interpretation: With an average salary around $115k, a $30k error is acceptable for general market estimation, though it indicates that additional variables (e.g., specific skills or geographic location) likely explain the remaining ~32% of variance.
4. Executive Summary¶
With 95% statistical confidence, we conclude that the most effective strategy to maximize salary in the AI sector is to prioritize vertical growth (experience level) over company type. After the logarithmic transformation, the model satisfies the Gauss–Markov assumptions, making it a valid tool for estimating fair salary ranges based on seniority and corporate context—excluding extreme outliers.